Section 1-4 - Building Pipelines

GridSearchCV reviews the performance of a set range of parameters on a cross-validation basis. This means only a portion of the training data is reviewed at any one time. When filling in the NA values with the mean value, however, we considered the whole set of training data.

Hence we took an inconsistent approach in reviewing only a portion of the data when running GridSearchCV, but the full set of data when filling in missing values. We can avoid this inconsistency by building pipelines and making imputations.

Pandas - Extracting data


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/train.csv')

Pandas - Cleaning data

We will leave the NA values in the column Age.


In [2]:
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)

age_mean = df['Age'].mean()

from scipy.stats import mode

mode_embarked = mode(df['Embarked'])[0][0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)

df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)

pd.get_dummies(df['Embarked'], prefix='Embarked').head(10)
df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked')], axis=1)

df = df.drop(['Sex', 'Embarked'], axis=1)

cols = df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]

df = df[cols]

We replace the NA values in the column Age with a negative value marker -1, as the following bug disallows us from using a missing value marker:

https://github.com/scikit-learn/scikit-learn/issues/3044


In [3]:
df = df.fillna(-1)

We then review our dataset.


In [4]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 11 columns):
Survived       891 non-null int64
PassengerId    891 non-null int64
Pclass         891 non-null int64
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Gender         891 non-null int64
Embarked_C     891 non-null float64
Embarked_Q     891 non-null float64
Embarked_S     891 non-null float64
dtypes: float64(5), int64(6)

In [5]:
train_data = df.values

Scikit-learn - Training the model

We now build a pipeline to enable us to first impute the mean value of the column Age on the portion of the training data we are considering, and second, assess the performance of our tuning parameters.


In [6]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.grid_search import GridSearchCV

imputer = Imputer(strategy='mean', missing_values=-1)

classifier = RandomForestClassifier(n_estimators=100)

pipeline = Pipeline([
    ('imp', imputer),
    ('clf', classifier),
])

We note the slight change made to the syntax inside our parameter grid.


In [7]:
parameter_grid = {
    'clf__max_features': [0.5, 1],
    'clf__max_depth': [5, None],
}

We now run GridSearchCV as before but replacing the classifier with our pipeline.


In [8]:
grid_search = GridSearchCV(pipeline, parameter_grid, cv=5, verbose=3)

In [9]:
grid_search.fit(train_data[0::,1::], train_data[0::,0])


Fitting 5 folds for each of 4 candidates, totalling 20 fits
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5 .........................
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, score=0.826816 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5 .........................
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, score=0.803371 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5 .........................
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, score=0.808989 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5 .........................
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, score=0.837079 -   0.3s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=5 .........................
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=5, score=0.837079 -   0.3s
[GridSearchCV] clf__max_features=1, clf__max_depth=5 ...........................
[GridSearchCV] .. clf__max_features=1, clf__max_depth=5, score=0.843575 -   0.2s
[GridSearchCV] clf__max_features=1, clf__max_depth=5 ...........................
[GridSearchCV] .. clf__max_features=1, clf__max_depth=5, score=0.792135 -   0.2s
[GridSearchCV] clf__max_features=1, clf__max_depth=5 ...........................
[GridSearchCV] .. clf__max_features=1, clf__max_depth=5, score=0.775281 -   0.2s
[GridSearchCV] clf__max_features=1, clf__max_depth=5 ...........................
[GridSearchCV] .. clf__max_features=1, clf__max_depth=5, score=0.870787 -   0.2s
[GridSearchCV] clf__max_features=1, clf__max_depth=5 ...........................
[GridSearchCV] .. clf__max_features=1, clf__max_depth=5, score=0.831461 -   0.2s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None ......................
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, score=0.826816 -   0.4s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None ......................
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, score=0.803371 -   0.4s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None ......................
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, score=0.831461 -   0.4s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None ......................
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, score=0.820225 -   0.4s
[GridSearchCV] clf__max_features=0.5, clf__max_depth=None ......................
[GridSearchCV]  clf__max_features=0.5, clf__max_depth=None, score=0.842697 -   0.4s
[GridSearchCV] clf__max_features=1, clf__max_depth=None ........................
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, score=0.843575 -   0.3s
[GridSearchCV] clf__max_features=1, clf__max_depth=None ........................
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, score=0.792135 -   0.3s
[GridSearchCV] clf__max_features=1, clf__max_depth=None ........................
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, score=0.780899 -   0.3s
[GridSearchCV] clf__max_features=1, clf__max_depth=None ........................
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, score=0.814607 -   0.3s
[GridSearchCV] clf__max_features=1, clf__max_depth=None ........................
[GridSearchCV]  clf__max_features=1, clf__max_depth=None, score=0.837079 -   0.3s
[Parallel(n_jobs=1)]: Done   1 jobs       | elapsed:    0.3s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:    5.8s finished

Out[9]:
GridSearchCV(cv=5,
       estimator=Pipeline(steps=[('imp', Imputer(axis=0, copy=True, missing_values=-1, strategy='mean', verbose=0)), ('clf', RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0))]),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'clf__max_features': [0.5, 1], 'clf__max_depth': [5, None]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=3)

In [10]:
sorted(grid_search.grid_scores_, key=lambda x: x.mean_validation_score)
grid_search.best_score_
grid_search.best_params_


Out[10]:
{'clf__max_depth': None, 'clf__max_features': 0.5}

Now that we've determined the desired values for our tuning parameters, we can fill in the -1 values in the column Age with the mean and train our model.


In [11]:
df['Age'].describe()


Out[11]:
count    891.000000
mean      23.600640
std       17.867496
min       -1.000000
25%        6.000000
50%       24.000000
75%       35.000000
max       80.000000
dtype: float64

In [12]:
df['Age'] = df['Age'].map(lambda x: age_mean if x == -1 else x)

In [13]:
df['Age'].describe()


Out[13]:
count    891.000000
mean      29.699118
std       13.002015
min        0.420000
25%       22.000000
50%       29.699118
75%       35.000000
max       80.000000
dtype: float64

In [14]:
train_data = df.values

In [15]:
model = RandomForestClassifier(n_estimators = 100, max_features=0.5, max_depth=5)
model = model.fit(train_data[0:,2:],train_data[0:,0])

Scikit-learn - Making predictions


In [16]:
df_test = pd.read_csv('../data/test.csv')

df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

We can fill in the NA values in test data with the mean, since there is no analogous problem of snooping.


In [17]:
df_test['Age'] = df_test['Age'].fillna(age_mean)

In [18]:
fare_means = df.pivot_table('Fare', index='Pclass', aggfunc='mean')
df_test['Fare'] = df_test[['Fare', 'Pclass']].apply(lambda x:
                            fare_means[x['Pclass']] if pd.isnull(x['Fare'])
                            else x['Fare'], axis=1)

df_test['Gender'] = df_test['Sex'].map({'female': 0, 'male': 1}).astype(int)
df_test = pd.concat([df_test, pd.get_dummies(df_test['Embarked'], prefix='Embarked')],
                axis=1)

df_test = df_test.drop(['Sex', 'Embarked'], axis=1)

test_data = df_test.values

output = model.predict(test_data[:,1:])


/Users/savarin/anaconda/envs/py27/lib/python2.7/site-packages/pandas/core/index.py:503: FutureWarning: scalar indexers for index type Int64Index should be integers and not floating point
  type(self).__name__),FutureWarning)

Pandas - Preparing for submission


In [19]:
result = np.c_[test_data[:,0].astype(int), output.astype(int)]

df_result = pd.DataFrame(result[:,0:2], columns=['PassengerId', 'Survived'])
df_result.to_csv('../results/titanic_1-4.csv', index=False)